{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Running LDA\n",
    "\n",
    "Running a topic model on (roughly) the 30,000 most common words in the **genremeta** corpus, as part of research for \"The Historical Significance of Textual Distance.\"\n",
    "\n",
    "This was the first time I had used LDA in scikit-learn. I was guided by Aneesha Bakharia's post on [\"Topic Modeling with Scikit-Learn,\"](https://medium.com/mlreview/topic-modeling-with-scikit-learn-e80d33668730) and borrowed some code snippets."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 29,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.feature_extraction.text import CountVectorizer\n",
    "from sklearn.decomposition import LatentDirichletAllocation\n",
    "import csv\n",
    "from collections import Counter\n",
    "import numpy as np\n",
    "import pandas as pd"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Data preparation\n",
    "\n",
    "We begin by counting \"document frequencies\" for words in the corpus. Then we save a vocabulary of the 30,000 words with highest doc frequency."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [],
   "source": [
    "# count doc frequencies\n",
    "\n",
    "vocabcount = Counter()\n",
    "\n",
    "with open('../parsejsons/rawcounts.tsv', encoding = 'utf-8') as f:\n",
    "    header = False\n",
    "    for line in f:\n",
    "        if not header:\n",
    "            header = True\n",
    "        else:\n",
    "            row = line.strip().split('\\t')\n",
    "            word = row[1]\n",
    "            vocabcount[word] += 1 "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In reality these words are already tokenized. But to take advantage of the sklearn tokenizer and avoid worries about data format I create pretend \"documents\" and allow the CountVectorizer to recount them."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [],
   "source": [
    "vocab = set([x[0] for x in vocabcount.most_common(30000)])\n",
    "\n",
    "docs = []\n",
    "\n",
    "with open('../parsejsons/rawcounts.tsv', encoding = 'utf-8') as f:\n",
    "    header = False\n",
    "    doc = []\n",
    "    lastdoc = ''\n",
    "    ctr = 0\n",
    "    for line in f:\n",
    "        if not header:\n",
    "            header = True\n",
    "            \n",
    "        else:\n",
    "            row = line.strip().split('\\t')\n",
    "            docid = row[0]\n",
    "            word = row[1]\n",
    "            count = int(row[2])\n",
    "            if lastdoc == '':\n",
    "                lastdoc = docid\n",
    "            elif docid != lastdoc:\n",
    "                docs.append(' '.join(doc))\n",
    "                doc = []\n",
    "                lastdoc = docid\n",
    "                ctr += 1\n",
    "            if word in vocab:\n",
    "                doc.extend([word] * count)          "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [],
   "source": [
    "tf_vectorizer = CountVectorizer(max_df=0.95, min_df=2, max_features=30000, stop_words='english')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [],
   "source": [
    "tf = tf_vectorizer.fit_transform(docs)\n",
    "tf_feature_names = tf_vectorizer.get_feature_names()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['10', '10th', '11', '12', '12th', '13th', '14th', '15th', '16th', '17th']"
      ]
     },
     "execution_count": 15,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "tf_feature_names[0:10]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 32,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "28443\n"
     ]
    }
   ],
   "source": [
    "print(len(tf_feature_names))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The actual length of the vocabulary after discarding stop words, etc, is 28,443 tokens.\n",
    "\n",
    "### Performing LDA\n",
    "\n",
    "This took 5-6 hours, even running multicore."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [],
   "source": [
    "no_topics = 100\n",
    "lda = LatentDirichletAllocation(n_topics=no_topics, max_iter=50, learning_method='online', learning_offset=50.,random_state=0).fit(tf)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {},
   "outputs": [],
   "source": [
    "def display_topics(model, feature_names, no_top_words):\n",
    "    for topic_idx, topic in enumerate(model.components_):\n",
    "        print(\"Topic %d:\" % (topic_idx))\n",
    "        print(\" \".join([feature_names[i]\n",
    "                        for i in topic.argsort()[:-no_top_words - 1:-1]]))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### display the topics\n",
    "\n",
    "Relying heavily on Bakharia's code right here."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Topic 0:\n",
      "ve sir mr doctor mrs inside ca brother maybe sort\n",
      "Topic 1:\n",
      "ve ca mr maybe inside city sir car miss street\n",
      "Topic 2:\n",
      "ve mr mrs maybe miss god sir ca car inside\n",
      "Topic 3:\n",
      "mr ve mrs sir street god inside maybe miss road\n",
      "Topic 4:\n",
      "ve mr maybe inside car sir book mrs ca pulled\n",
      "Topic 5:\n",
      "mr mrs miss street sir fellow sort ve shall lady\n",
      "Topic 6:\n",
      "ve ca mr god miss mrs inside car maybe brother\n",
      "Topic 7:\n",
      "san el la spanish del litde sal nat murphy fay\n",
      "Topic 8:\n",
      "ve mr mrs ca god maybe street wo inside sea\n",
      "Topic 9:\n",
      "grant beth miller junior daddy momma auntie laurel jenkins faith\n",
      "Topic 10:\n",
      "ve mr god ca mrs lot inside maybe car sir\n",
      "Topic 11:\n",
      "sea boat ship captain island deck shore beach fish land\n",
      "Topic 12:\n",
      "ve mr car mrs street ca maybe prince floor lot\n",
      "Topic 13:\n",
      "ve maybe ca mr pulled god king street doctor sea\n",
      "Topic 14:\n",
      "king prince court lord brother evans apartment pineapple yellow terrible\n",
      "Topic 15:\n",
      "ai em ye ve folks goin dat yer ca er\n",
      "Topic 16:\n",
      "finn ron sunny dixon pirates bliss conan pirate sol balloon\n",
      "Topic 17:\n",
      "doctor ve ca maybe mr god inside king miss lot\n",
      "Topic 18:\n",
      "ve mr ca miss maybe god road car mrs sir\n",
      "Topic 19:\n",
      "ve car maybe mr inside god ca street pulled ship\n",
      "Topic 20:\n",
      "ve mr mrs maybe god inside street ca car town\n",
      "Topic 21:\n",
      "ve ca mr lot inside mrs god sea shall road\n",
      "Topic 22:\n",
      "ve ca ray maybe mr car god shall school inside\n",
      "Topic 23:\n",
      "darkness beneath human stone cried sky figure appeared shadow grew\n",
      "Topic 24:\n",
      "king ve mr ca god sea city inside shall car\n",
      "Topic 25:\n",
      "ve ca mr car mrs miss god maybe street wo\n",
      "Topic 26:\n",
      "ve car police ca office nodded smiled probably shook wo\n",
      "Topic 27:\n",
      "ve mr car ca maybe city floor street pulled inside\n",
      "Topic 28:\n",
      "doctor ve mr mrs ship prince maybe ca sea god\n",
      "Topic 29:\n",
      "ve ca mr god mrs miss school probably inside maybe\n",
      "Topic 30:\n",
      "ve ca mr god maybe inside car king lot lord\n",
      "Topic 31:\n",
      "ve maybe mr car god ca probably inside smiled city\n",
      "Topic 32:\n",
      "ed hank moore watson billie mo hawkins gates campbell bo\n",
      "Topic 33:\n",
      "ve mr ca god wo maybe inside floor car city\n",
      "Topic 34:\n",
      "mr ve shall mrs god maybe son miss lady inside\n",
      "Topic 35:\n",
      "mr ve ca miss car maybe god inside dayoftheweek wo\n",
      "Topic 36:\n",
      "mark ve ye mum gloria sean ok bloody ah ken\n",
      "Topic 37:\n",
      "mr judge mrs court jury honour lawyer abe law witness\n",
      "Topic 38:\n",
      "ve ca takes inside goes asks turns skin husband tea\n",
      "Topic 39:\n",
      "mr maybe ve ca miss street mrs inside wo god\n",
      "Topic 40:\n",
      "ve ca maybe car mr lot miss pierce wo school\n",
      "Topic 41:\n",
      "ve mr ca floor mrs maybe miss street chair inside\n",
      "Topic 42:\n",
      "ve maybe ca god mr mrs car inside son lot\n",
      "Topic 43:\n",
      "sonny burke mick bud baxter robinson tom bel butler chin\n",
      "Topic 44:\n",
      "nodded pulled ve inside fingers stared shook smiled shoulder floor\n",
      "Topic 45:\n",
      "ve ca maybe miss god road floor inside mr city\n",
      "Topic 46:\n",
      "mr ve mrs ca lady miss street maybe sir inside\n",
      "Topic 47:\n",
      "car maybe kitchen street floor school glass town ve pulled\n",
      "Topic 48:\n",
      "god christian soul stone hell heaven evil satan luna maybe\n",
      "Topic 49:\n",
      "ve mr mrs ca god lot maybe inside car king\n",
      "Topic 50:\n",
      "ve god mr ca mrs maybe miss car king inside\n",
      "Topic 51:\n",
      "ve mr miss ca mrs sir god car banks street\n",
      "Topic 52:\n",
      "general war captain army colonel soldiers officer enemy lieutenant officers\n",
      "Topic 53:\n",
      "princess king shall queen cried son till daughter gold tree\n",
      "Topic 54:\n",
      "ve mr ca mrs god miss wo shall son hamilton\n",
      "Topic 55:\n",
      "ve god ca mr car office mrs inside lot maybe\n",
      "Topic 56:\n",
      "ve war american office state city decided school public finally\n",
      "Topic 57:\n",
      "cat magic power spell wizard queen tiger magician sword song\n",
      "Topic 58:\n",
      "mrs miss lady dear mr shall husband ve suppose letter\n",
      "Topic 59:\n",
      "ve mr maybe car ca mrs floor god street inside\n",
      "Topic 60:\n",
      "bird birds pigeon pigeons spleen gun animal sky brunette alley\n",
      "Topic 61:\n",
      "human space ve ship planet ca science machine power thousand\n",
      "Topic 62:\n",
      "shall thou replied lady lord manner thy nature thee till\n",
      "Topic 63:\n",
      "ben madame monsieur french lily la st doctor victor le\n",
      "Topic 64:\n",
      "horse river horses road land camp indian town trees mountain\n",
      "Topic 65:\n",
      "ray earl harper flora glen pierce amber star holt cliff\n",
      "Topic 66:\n",
      "ve mr ca mrs maybe car god king inside miss\n",
      "Topic 67:\n",
      "ve mr miss ca car sir street maybe son school\n",
      "Topic 68:\n",
      "ve god inside city king ca mr maybe car sea\n",
      "Topic 69:\n",
      "ve mr miss sir shall ca maybe mrs god car\n",
      "Topic 70:\n",
      "ve ca mr car maybe miss inside street finally sort\n",
      "Topic 71:\n",
      "mr ve god sir miss maybe ca inside shall wo\n",
      "Topic 72:\n",
      "sir castle count master chapel hall english hyde arch shall\n",
      "Topic 73:\n",
      "ve inside ca maybe mr miss car city floor lot\n",
      "Topic 74:\n",
      "ve mr maybe god ca inside mrs shall lady miss\n",
      "Topic 75:\n",
      "ve mr car miss floor inside ca pulled sir maybe\n",
      "Topic 76:\n",
      "ve ca god banks mr car king floor inside maybe\n",
      "Topic 77:\n",
      "ve president car phone office security inside gun maybe area\n",
      "Topic 78:\n",
      "village tree forest chief food kill animals trees hut animal\n",
      "Topic 79:\n",
      "uncle aunt mama school baby sister brother boys girls papa\n",
      "Topic 80:\n",
      "ve mr shall mrs sir ca lady replied god city\n",
      "Topic 81:\n",
      "ve maybe mr ca floor car lot inside god street\n",
      "Topic 82:\n",
      "write writing book professor stories wrote letter written writer letters\n",
      "Topic 83:\n",
      "lord city son god sword master king priest brother lady\n",
      "Topic 84:\n",
      "ve ca doctor maybe boat street pulled smiled car sea\n",
      "Topic 85:\n",
      "ve ca mr god maybe city car sir mrs wo\n",
      "Topic 86:\n",
      "ve mr ca miss mrs office school maybe inside street\n",
      "Topic 87:\n",
      "ve god mr mrs car maybe road ca school street\n",
      "Topic 88:\n",
      "mr ve ca mrs car god floor maybe inside phone\n",
      "Topic 89:\n",
      "ve king maybe mr god shall ca dayoftheweek sea romannumeral\n",
      "Topic 90:\n",
      "ve ca mr mrs miss car probably god floor inside\n",
      "Topic 91:\n",
      "maybe ve okay yeah phone ca shit lot car kids\n",
      "Topic 92:\n",
      "don ﬁrst ﬁnd ﬁre won ﬁve ﬂoor ﬁne bert ted\n",
      "Topic 93:\n",
      "ve mr maybe ca car inside god wo glass street\n",
      "Topic 94:\n",
      "mr ve mrs ca miss romannumeral street city dayoftheweek sir\n",
      "Topic 95:\n",
      "ve god ca maybe mr car city miss street lucky\n",
      "Topic 96:\n",
      "belle masters ferris producer blind mrs news sterile blindness lawyer\n",
      "Topic 97:\n",
      "romannumeral dr book smith page books novel history edition english\n",
      "Topic 98:\n",
      "ve god mr mrs sir ca maybe miss lot road\n",
      "Topic 99:\n",
      "abbot vane ve mrs ca mr lord miss shoulder god\n"
     ]
    }
   ],
   "source": [
    "display_topics(lda, tf_feature_names, 10)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {},
   "outputs": [],
   "source": [
    "doc_topic_dist_unnormalized = np.matrix(lda.transform(tf))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {},
   "outputs": [],
   "source": [
    "doc_topic_dist = doc_topic_dist_unnormalized/doc_topic_dist_unnormalized.sum(axis=1)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(6845, 100)"
      ]
     },
     "execution_count": 23,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "doc_topic_dist.shape"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "metadata": {},
   "outputs": [],
   "source": [
    "doc_topic_dist.tofile('doc_topic_dist.binary')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### pair docids with rows of the doc_topic_dist and write to file\n",
    "\n",
    "first let me count documents to make sure they line up"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "6846\n"
     ]
    }
   ],
   "source": [
    "docids = []\n",
    "docset = set()\n",
    "\n",
    "with open('../parsejsons/rawcounts.tsv', encoding = 'utf-8') as f:\n",
    "    header = False\n",
    "    for line in f:\n",
    "        if not header:\n",
    "            header = True\n",
    "        else:\n",
    "            row = line.strip().split('\\t')\n",
    "            docid = row[0]\n",
    "            if docid not in docset:\n",
    "                docids.append(docid)\n",
    "                docset.add(docid)\n",
    "print(len(docids))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Ah. Now I realize that the code I used to read documents missed the last one! LDA takes too long to run to re-do this; I'm going to just cut the last doc, which turns out to be a single volume in randomA. We won't miss it."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'uc1.b4357818'"
      ]
     },
     "execution_count": 26,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "docids[-1]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 30,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(6845, 100)"
      ]
     },
     "execution_count": 30,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "docdist = pd.DataFrame(doc_topic_dist, index = docids[0: -1])\n",
    "docdist.shape"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 31,
   "metadata": {},
   "outputs": [],
   "source": [
    "docdist.to_csv('doc_topic_distribution.csv', index_label = 'docid')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.5.2"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
